Stochastic Gradient Descent (SGD)#
Stochastic Gradient Descent is an optimization algorithm that differs from the usual Gradient Descent in that the gradient of the optimized function is considered at each step not as the sum of gradients from each element of the sample, but as the gradient from one randomly selected element
Gradient Descent (GD) review#
Before going into the study of stochastic gradient descent, let’s recall the formula of ordinary gradient descent. We have the following:
\(\boldsymbol w\) = \({(w_1, w_2, ..., w_n)}^{T}\) is a vector of parameters, where \(n\) is the size of the \(\boldsymbol w\)
\(L\) - differentiable multi-variable function
\(\nabla L(\boldsymbol w)\) - gradient of the function (the meaning is shown (1))
\(Q(\boldsymbol w)\) - function which we tend to minimize where with respect to “\(\boldsymbol w\)” or \(Q(\boldsymbol w) = \sum\limits_{i=1}^\ell L_i(\boldsymbol w) \to \min\limits_{\boldsymbol w}\)
\(\ell\) - number of input vector parameters
Choose \(\boldsymbol w^{(0)}\) - initial vector approach. Then, each subsequent vector of parameters will be calculated as
where \(\eta\) - gradient step or learning rate (responsible for how much to change the vector of parameters in the direction of the gradient)
The stopping of the algorithm will be determined by the convergence of \(Q\) or \(\boldsymbol w\).
Stochastic Gradient Descent (SGD)#
GD algorithm problem#
Considering the previous algorithm (5), we come to a problem. To determine a new approximation of the vector \(\boldsymbol w\), it is necessary to calculate the gradient of each element i.e. \( \sum\limits_{i=1}^\ell \nabla L_i(\boldsymbol w)\) HOWEVER It can significantly slow down the algorithm since it is very resource - intensive
SGD derivation#
In order to improve the efficiency of the algorithm we may replace calculation of the real gradient for the whole input points with the pseudogradient for only one point or sequence of points.
Pseudogradient (\(\nabla \widehat Q\)) must have the following properties:
It must form an acute (sharp) angle with a true gradient in the “\(n\)” dimensional space
It must be calculated much simpler than the true gradient \(\nabla Q(\boldsymbol w)\) over the input points.
Accordingly, at each step, instead of calculating the whole sum of gradients from the whole number of inputs (\(\sum\limits_{i=1}^{\ell} \nabla L_i(\boldsymbol w)\)) one observation can be considered (\(\nabla L_k(\boldsymbol w)\)), i.e.
where, \( 1\leqslant k \leqslant \ell\) (usually random)
Did you know that…
“SGD” abbreviation is not just a Stochastic gradient descent. It is also known as short for the Singapore Dollar, which looks like this

By the way, the exchange rate of 1 Singapore dollar (SGD) at the time of writing this topic is 344.38 tenge (tg)
SGD Realization Pseudocode#
Algorithm (Stochastic Gradient descent)
Input: Convergence step (learning rate) “\(\eta\)”, rate of “forgetting” “\(\lambda\)”
Output: Vector of parameters \(\boldsymbol w\)
To solve the optimization problem
do the followind steps:
Initialization of \(\boldsymbol w\) with some initial values (e.g., from \(\mathcal N(0, 1\)))
Initial calculation of the function: \((\widehat Q (\boldsymbol w) = \frac 1\ell \sum\limits_{i=1}^\ell L_i(\boldsymbol w))\)
While \(\widehat Q\) and/or \(\boldsymbol w\) don’t reach set values DO:
Choose random index value \(k\) where, 1 \leqslant k \leqslant \ell
Calculate function \(E_k = L_k(\boldsymbol w)\)
Recalculate vector: \(\boldsymbol w = \boldsymbol w - \eta \nabla L_k(\boldsymbol w)\) (rate of “forgetting” “\(\lambda\)” is just a constant which is commonly very small)
Recalculate function: \(\widehat Q = \lambda E_k + (1-\lambda)\widehat Q\)
return “\(\boldsymbol w\)”
Q: How was the function \(\widehat Q = \lambda E_k + (1-\lambda)\widehat Q\) derived?
Hint
Use the formula for finding just an average:
Then try to express the same formula for \(Q_{\ell - 1}\) and think how to combine them.
Assume that, \(\lambda = \frac 1{\ell}\)
WARNING:tensorflow:From C:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
SGD Exercise:#
Given:
\(\boldsymbol w = (w_1, w_2, w_3) ^ T = (1, -1, 0.5) ^ T\)
\(\eta = 0.1\)
\(\lambda = 0.8\)
\(L(\boldsymbol w) = w_1^{2} + w_2^{2} + 0.5 w_3^{2}\)
What are the parameters of \(\boldsymbol w\) in the next iteration?
Mini Batch GD#
Mini batch gradient descent is actually a mixture of Batch Gradient Descent and SGD. It computes the gradient of the function over small random subsets of the elemets (mini batches).
Algorithm (Mini Batch GD)
To solve the optimization problem:
do the followind steps:
Initialization of \(\boldsymbol w\) with some initial values (e.g., from \(\mathcal N(0, 1\)))
Initial calculation of the function: \((\widehat Q (\boldsymbol w) = \frac 1\ell \sum\limits_{i=1}^\ell L_i(\boldsymbol w))\)
While \(\widehat Q\) and/or \(\boldsymbol w\) don’t reach set values DO:
Choose subset of \(m\) random indexes \(k\) i.e \(B\) = {\(k_1, k_2, ..., k_m\)}, where 1 \(\leqslant m\) < \(\ell\)
Calculate function \(E_{B}(\boldsymbol w) = \sum\limits_{i \in B} L_i(\boldsymbol w)\)
Recalculate vector: \(\boldsymbol w = \boldsymbol w - \eta \sum\limits_{i \in B} \nabla L_i(\boldsymbol w)\)
Recalculate function: \(\widehat Q = \lambda E_{B}(\boldsymbol w) + (1-\lambda)\widehat Q\)
return \(\boldsymbol w\)
Mini Batch GD VS SGD#
Heuristics#
According to the description, “heuristics” is understood as a set of techniques as well as the methods that facilitate and simplify the solution of corresponded tasks. Here is the list of suggested heuristics during the work with the SGD:
Initialization of the vector \(\boldsymbol w\)
Choice for learning rate \(\eta\)
Initialization of the vector \(\boldsymbol w\)#
One of the crucial steps during SGD algorithm is initialization of the \(\boldsymbol w\). It is common to set zero vector as initial values, but there are also other approaches.
So, w = 0;
Another way is to take small random \(w\), since if initialize large initial values it can lead to an area with small modulo derivatives and therefore it will be difficult to get out of such an area.
So, \(w_j\) = random(\(-\frac 1{2n}\), \(\frac 1{2n}\)) (The denominator is devided by “\(2n\)” because if case of \(n\) = 1 the learning rate would be 1, but that’s an unppropriate value. So, it can be at least 0.5 in this case.)
Dynamic Learning Rate \(\eta\)#
As an optimization, learning rate variable \(\eta\) value can be dynamicly changed at each step of iteration. Replace \(\eta\) with a time-dependent learning rate \(\eta(t)\). There is a list of some commonly used formulas for \(\eta(t)\).
Note
If \(\eta(t)\) changes too quick, then stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization.
Constant learning rate VS Dynamic learning rate#
Firstly, look at the SGD with \(\eta\) = 0.1
Next, compare the previous graph to the SGD with exponential learning rate or \(\eta(t)=\eta_o \cdot e^{-\lambda t}\;\) (\(\lambda\) = 0.1)
Advantages and Disadvantages#
As any algorithm, the SGD has its own pros and cons.
Advatages:
Simple for implementation
if the function \(L\) is not differentiable, it can be approximated by differentiable
Suitable for the exercises with huge number of data (input points)
Disadvantages:
It can simply get stuck in local optima
It has strong fluctuation
Optimizers of Gradient Algorithms#
The main disadvantage of gradient algorithms is getting stuck in local optima (minimum).
Moreover, at such points, the derivative is zero, which means that the further change of vector parameters \(\boldsymbol w\) according to the formula (6) ( \(\nabla Q(w) = \nabla L_k(w, x_k)\) = 0 ) will not happen (we will stuck).
Consider the graph below to see the stuck in local optima.
Moreover, at such points, the derivative is zero, which means that the further change of vector parameters \(\boldsymbol w\) according to the formula (6) ( \(\nabla Q(w) = \nabla L_k(w, x_k)\) = 0 ) will not happen (we will stuck).
In addition, stochastic gradient algorithms can form strong fluctuations when moving to the minimum point of the function
For solving above problems there were dedicated several solutions. See below
Momentum Gradient Descent#
It was proposed to average the gradient in steps using the formula:
where, \(\boldsymbol v\) - accumulates the average of the gradients
\(\;\;\;\;\;\;\;\;\;\gamma\) - parameter (using it, you can adjust how many past gradients we will take into account in the value of \(\boldsymbol v\))
Note
if rewrite the \((1-\gamma)\eta_t\) from the formula {eq}’momentum-formula’ to \(\eta_t = (1-\gamma)\cdot\eta_t\) we obtain the algorithm: \(\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w)\)
Finally, just change the \(\boldsymbol w\) using the familiar principle:
where, \(\boldsymbol v\) - smoothed gradients
How does the \(\gamma\) influences the number of last gradients averaged?
The expression shows how the number of last gradients are considered in for \(\boldsymbol v\) calcucation \(N \approx \frac 2 {(1-\gamma)} + 1\)
It is an approximate formula because the in exponential moving average all the gradients are averaged that were met.
NAG - Nesterov’s accelerated gradient#
Taking into account the already mentioned formula in momentum:
where, \(\eta_t = (1-\gamma)\cdot\eta_t\) for simplicity
Now, we can represent the “\(\boldsymbol v\)” as the sum of the vectors “\(\gamma \boldsymbol v\)” and “\(\eta_t \cdot \nabla Q(\boldsymbol w)\)”.
But at the time of calculating the gradient, we already know that we will shift by the value of “\(\gamma \boldsymbol v\)”. Therefore, it would be more correct to calculate the gradient not at point \(\boldsymbol w\), but at point (\(\boldsymbol w - \gamma \boldsymbol v\)) or:
The detailed step by step NAG visualization algorithm to quadratic function (\(f(x) = x^2\)) is shown below. The moving ahead step is also added.
Mathematically NAG can be expressed as:
And such an approach, as a rule, shows the best convergence to the optimum point.